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9 Title of the Invention 



M Low-Overhead Threads in A High-Concurrency System 

; y Background of the Invention 

•I4 This application claims the benefit of U.S. Provisional Application No. 

l| 60/195,732, filed 4/7/00 (Attorney Docket number 103.1032.01). 

16 

17 L Field of the Invention 

18 

19 This invention relates to low-overhead threads in a high-concurrency sys- 

20 tern, such as for a networked cache or file server. 

/ 
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Related Art 



3 In many computing systems, it is desirable in certain circumstances to be 

4 able to process, relatively simultaneously (such as in parallel), a relatively large number 

5 of similar tasks. For example, the same or similar tasks could be performed by a server 

6 device (such as a file server) in response to requests by a number of client devices. One 

7 such circumstance is in a networked cache or file server, which maintains and processes a 

8 relatively large number of sequences of requests (sometimes called "connections"), so as 

9 to couple an information requester (such as a web client) to one or more information pro- 
h viders, which are also coupled to the same internetworking system. One known method 
*) in which an individual processor or a multiprocessor system is able to maintain a high de- 
ft gree of concurrency is for the system to process each connection using a separate proc- 

3 essing thread. A "thread" is a locus of control within a process, indicating a spot within 

4 that process that the processor is then currently executing. In general, a thread has a rela- 
f tively small amount of state information associated therewith, generally consisting only of 
6 a calling stack and a relatively small number of local variables. 



8 High concurrency systems, such as networked caches and file servers used 

9 in an internetworking system, must generally maintain a large number of threads. Each 

0 information requester has its own separate connection for which the network cache or file 

1 server must maintain some amount of state information. Each such separate connection 

2 requires only a small amount of state information, such as approximately 100 to 200 bytes 
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of information. Since there are in many cases a relatively large number of individual 
connections, it would be desirable to be able to maintain state information about each 
such connection using only a relatively minimal amount of memory and processor over- 
head, while simultaneously maintaining both relatively reliable programmability and rela- 
tively high processing speed. 

One problem with known systems is that allocation of state information for 
individual threads does not generally scale well. One of the problems with relatively 
large numbers of individual threads is that of allocating memory space for a calling stack 
for each one of those threads. In a first set of known systems, stack space for individual 
threads is allocated statically; this has the drawback that relatively large numbers of 
threads require a relatively large amount of memory to maintain all such stack spaces. 
Although the amount of stack space statically allocated for each individual thread can be 
reduced significantly, this has the drawback that operations that can be performed by each 
individual thread are similarly significantly restricted. In a second set of known systems, 
stack space for individual threads is allocated dynamically; this has the drawback that the 
minimum size for dynamic allocation of memory is generally measured in kilobytes, re- 
sulting in substantial unnecessary memory overhead. Although virtual memory can be 
used to store and retrieve stack space for individual threads in smaller increments, this has 
the drawback that compression and decompression of stack space for individual threads 
imposes substantial unnecessary processor overhead. In a third set of known systems, 
such as those using the Java programming language, dynamic memory allocation is used 
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1 to store and retrieve stack space for individual threads; this has the drawback that each 

2 procedure call within each thread imposes substantial unnecessary processor overhead. 

3 

4 An additional problem is introduced by the particular use made of multi- 

5 threading by the WAFL file system (as described in the Incorporated Disclosures). In the 

6 WAFL file system, the C language "setjmp" and "longjmp" routines are combined with 

7 message passing among threads so as to support high concurrency using threads. In par- 

8 ticular, the requester of an initial file request to the WAFL file system packages the re- 

9 quest in a message, which the WAFL file system processes using ordinary procedural 
;ld program code, so long as data is available for processing the request and the thread need 

tf? : 

M not have its execution suspended. If the thread is suspended for any reason (such as if a 
resource is not available,) the WAFL file system: (1) requests the needed resource, (2) 

*i3 queues the message for signaling when the resource is available, and (3) calls the C rout- 

Q 

S ing "longjmp" to return to the origin of the routine for processing the message. Thus, the 

■ 

|Tj WAFL file system restarts processing the entire message from the very beginning until all 

16 needed resources are available and processing can complete without suspension. While 

17 this use of multithreading by the WAFL file system has the advantage that programmers 

18 do not need to encode program state when a routine is suspended, it has the disadvantage, 

19 when combined with multithreading, that all necessary data structures (to process any ar- 

20 bitrary message) must be collected before the entire message can be processed. In an in- 

21 ternetworking environment, collecting all such structures can be difficult and subject to 

22 error. 
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2 Accordingly, it would be advantageous to provide a technique for creating 

3 and using relatively low-overhead threads in a high-concurrency system, such as for a 

4 networked cache or file server, that is not subject to drawbacks of the known art. 

5 

6 Summary of the Invention 

7 

8 The invention provides a method and system for providing the functionality 

9 of dynamically-allocated threads in a multithreaded system in which the operating system 



]*& provides only statically-allocated threads. With this functionality, a relatively large num- 

M ber of threads can be maintained without a relatively large amount of overhead (either in 

53: 

:h memory or processor time), and it remains possible to produce program code without un- 



<T3 due complexity. 

m 

ijg In a preferred embodiment, a plurality of dynamically-allocated threads are 

16 simulated using a single statically-allocated thread, but with state information regarding 

17 each dynamically-allocated thread maintained within the single statically-allocated thread. 

18 The single statically-allocated thread includes, for each procedure call that would other- 

19 wise introduce a new dynamically-allocated thread, a memory block including: (1) a rela- 

20 tively small procedure call stack for the new dynamically-allocated thread, and (2) a rela- 

21 tively small collection of local variables and other state information for the new dynami- 

22 cally-allocated thread. When using multithreading in the WAFL file system, high 
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1 concurrency among threads can be maintained without any particular requirement that the 

2 program code maintain a substantial amount of state information regarding each dynami- 

3 cally-allocated thread. Each routine in the WAFL file system that expects to be sus- 

4 pended or interrupted need maintain only a collection of entry points into which the rou- 

5 tine is re-entered when the suspension or interruption is completed. A feature of the C 

6 language preprocessor allows the programmer to generate each of these entry points 

7 without substantial additional programming work, with the aid of one or more program- 

8 ming macros. 

9 

The invention provides an enabling technology for a wide variety of appli- 
■| cations for multithreaded systems so as to obtain substantial advantages and capabilities 
that are novel and non-obvious in view of the known art. Examples described below pri- 
marily relate to networked caches and file servers, but the invention is broadly applicable 
to many different types of automated software systems. 

6 Brief Description of the Drawings 

7 

8 Figure 1 shows a block diagram of a system for providing functionality of 

9 low-overhead threads in a high-concurrency system, such as for a networked cache or file 

10 server. 
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Figure 2 shows a process flow diagram of a system for providing function- 
ality of low-overhead threads in a high-concurrency system, such as for a networked 
cache or file server. 



Detailed Description of the Preferred Embodiment 

In the following description, a preferred embodiment of the invention is de- 
scribed with regard to preferred process steps and data structures. Embodiments of the 
invention can be implemented using general-purpose processors or special purpose proc- 
essors operating under program control, or other circuits, adapted to particular process 
steps and data structures described herein. Implementation of the process steps and data 
structures described herein would not require undue experimentation or further invention. 



Lexicography 



The following terms refer or relate to aspects of the invention as described 
below. The descriptions of general meanings of these terms are not intended to be limit- 
ing, only illustrative. 



o client and server — In general, these terms refer to a relationship between two 
devices, particularly to their relationship as client and server, not necessarily to any 
particular physical devices. 
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For example, but without limitation, a particular client device in a first relationship 
with a first server device, can serve as a server device in a second relationship with 
a second client device. In a preferred embodiment, there are generally a relatively 
small number of server devices servicing a relatively larger number of client de- 
vices. 

o client device and server device — In general, these terms refer to devices taking 
on the role of a client device or a server device in a client-server relationship (such 
as an HTTP web client and web server). There is no particular requirement that 
any client devices or server devices must be individual physical devices. They can 
each be a single device, a set of cooperating devices, a portion of a device, or some 
combination thereof. 

For example, but without limitation, the client device and the server device in a 
client-server relation can actually be the same physical device, with a first set of 
software elements serving to perform client functions and a second set of software 
elements serving to perform server functions 

As noted above, these descriptions of general meanings of these terms are 
not intended to be limiting, only illustrative. Other and further applications of the inven- 
tion, including extensions of these terms and concepts, would be clear to those of ordinary 
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1 skill in the art after perusing this application. These other and further applications are 

2 part of the scope and spirit of the invention, and would be clear to those of ordinary skill 

3 in the art, without further invention or undue experimentation. 

4 

5 System Elements 

6 

7 Figure 1 shows a block diagram of a system for providing functionality of 

8 low-overhead threads in a high-concurrency system, such as for a networked cache or file 

9 server. 

jfj 

■i\ 

m A system 100 includes a networked cache or file server (or other device) 

ass: 

<0 1 10, a sequence of input request messages 120, and a set of software elements 130. 

in 

M The networked cache or file server (or other device) 110 includes a com- 

. r 51 

(B puter having a processor, program and data memory, mass storage, a presentation ele- 

-srt: 
LJ 

16 ment, and an input element, and is coupled to a communication network. As used herein, 

17 the term "computer" is intended in its broadest sense, and includes any device having a 

18 programmable processor or otherwise falling within the generalized Turing machine 

19 paradigm. The mass storage can include any device for storing relatively large amounts 

20 of information, such as magnetic disks or tapes, optical devices, magneto-optical devices, 

21 or other types of mass storage. 

22 
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1 The input request messages 120 include a set of messages requesting the 

2 networked cache or file server 1 10 to perform actions in response thereto. In a preferred 

3 embodiment, the actions to be performed by the networked cache or file server 1 10 will 

4 involve access to the mass storage or to the communication network. In a preferred em- 

5 bodiment, the input request messages 120 are formatted in a known request protocol, such 

6 . as NFS, CIFS, HTTP (or variants thereof), but there is no particular requirement for the 

7 input request messages 120 to use these known request protocols or any other known re- 

8 quest protocols. In a preferred embodiment, the networked cache or file server 110 re- 

9 sponds to the input request messages 120 with both: (1) a condign set of responsive ac- 
$b tions involving the mass storage or the vacation network, and (2) a condign response to 
% the input request messages 120, the response to the input request messages 120 preferably 

'.hi: 

h taking the form of a set of response messages (not shown.) 

;13 

- Jfi The software elements 130 include a set of programmed routines to be per- 

m formed by the networked cache or file server 1 10, using the functionality of low-overhead 

16 threads and high-concurrency as described herein. Although particular program code is 

17 described herein with regard to the programmed routines, there is no particular reason that 

18 the software elements 130 must use the specific program code described herein, or any 

19 other specific program code. 

20 
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1 Method of Operation 
2 

3 Figure 2 shows a process flow diagram of a system for providing function- 

4 ality of low-overhead threads in a high-concurrency system, such as for a networked 

5 cache or file server. 

6 . 

7 A method 200 includes a set of flow points and a set of steps. The system 

8 100 performs the method 200. Although the method 200 is described serially, the steps of 

9 the method 200 can be performed by separate elements in conjunction or in parallel, 
;fl whether asynchronously, in a pipelined manner, or otherwise. There is no particular re- 
M quirement that the method 200 be performed in the same order in which this description 

&t lists the steps, except where so indicated. 

y. 

1? 

p\ At a flow point 210, the networked cache or file server 1 10 is ready to re- 

ITS ceive and respond to the input request messages 120. 

16 

17 At a step 21 1, the networked cache or file server 110 receives an input re- 

18 quest message 120, and forwards that input request message 120 to an appropriate soft- 

19 ware element 130 for processing. In a preferred embodiment, the step 21 1 includes per- 

20 forming a calling sequence for the software element 130, including possibly creating a 

21 simulated dynamically allocated thread (that is, a thread simulated so as to appear to be 

22 dynamically-allocated, hereinafter sometimes called a "simulated thread" or an "S- 
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thread") within which the software element 130 is performed. Thus, the software element 

130 can be created using program code that assumes that the software element 130 is per- 
formed by a separate thread and does not demand relatively excessive resources (either 
memory or processor time.) 

As part of step 21 1, the networked cache or file server 1 10 allocates a pro- 
cedure call block 131 and a local variable block 132, for use by the simulated dynami- 
cally-allocated thread performed by the software element 130. The procedure call block 

131 includes a set of input variables for input to the software element 130 5 a set of output 
variables for output from the software element 130, and such other stack element as is 
known in the art of calling stacks for procedure calls. The local variable block 132 in- 
cludes a set of locations in which to store local variables for the software element 130. 

As part of step 211, the networked cache or file server 110 determines 
whether the software element 130 is a subroutine of a previously called software element 
130 in the same simulated thread. If so, the networked cache or file server 1 10 indicates 
that fact in a block header 133 for the software element 130, so as to point back to the 
particular software element 130 that was the parent (calling) software element 130. If 
not, the networked cache or file server 1 10 does not indicate that fact in the block call or 
block header for the software element 130. 
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1 As part of this step, the networked cache or file server 110 determines 

2 whether the software element 130 is to be performed by a new simulated thread. If so, the 

3 networked cache or file server 110 adds the new thread block 134 to a linked list 135 of 

4 thread blocks 134 to be performed in turn according to a scheduler. In a preferred em- 

5 bodiment, the scheduler simply performs each simulated thread corresponding to the next 

6 . thread block 134 in ro und-rob in sequence, so that each simulated thread corresponding to 

7 a thread block 134 is performed in its turn, until it is suspended or completes. However, 

8 in alternative embodiments, the scheduler may select simulated threads in other than a 

9 round-robin sequence, so as to achieve a desired measure of quality of service, or other 
% administrative goals. 

■ P 

42 At a step 212, the networked cache or file server 1 10 chooses the simulated 

:i3 thread for execution. The simulated thread, with appropriate data completed for the pro- 

m cedure call block 131 and local variable block 132, is performed in its turn, until it is sus- 

15 pended or completes. If the simulated thread is capable of completing its operation with- 

16 out being suspended or interrupted, the scheduler selects the next thread block 134 in the 

17 linked list of thread blocks 134 to be performed in turn. 

18 

19 After this step, the method 200 has performed one round of receiving and 

20 responding to input request messages 120, and is ready to perform another such round so 

21 as to continuously receive and respond to input request messages 120. 

22 
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1 The method 200 is performed one or more times starting from the flow 

2 point 210 and continuing therefrom. In a preferred embodiment, the networked cache or 

3 file server 1 10 repeatedly performs the method 200, starting from the flow point 210 and 

4 continuing therefrom, so as to receive and respond to input request messages 120 periodi- 

5 cally and continuously. 



7 Program Structures 



A set of program structures in a system for providing functionality of low- 
overhead threads in a high-concurrency system, such as for a networked cache or file 
server, includes one or more of, or some combination of, the following: 



o A set of program structures for declaring and creating a dynamically-allocated thread 
in a system in which threads are usually statically-allocated; 




\%\ 
di 

ssbi; 

¥ 

m 
? 

16 
17 
18 
19 
20 

21 In the program structure above, the definition for the structure type "func- 

22 tionjnsg" includes: (1) the local variables for the dynamically-allocated thread, (2) any 



typedef struct { 

// local variables 

int arg; // an example, not necessary 
}function_msg; 
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1 input arguments to the dynamically-allocated thread, in this case just the one variable 

2 "arg", and (3) any output arguments from the dynamically-allocated thread, in this case 



3 none. 



5 o A set of program structures for denoting program code entry-points for a simulated 

6 thread; 



static void 

function_sthread(sthread_msg *m) 

{ 

function_msg * const msg = m->data; 

STHREAD_START_BLOCK (m); 
// executable C code 

STHREAD_RESTART_POINT (m); // an example 

blocking point 

// executable C code 

STHREAD_COND_WAIT (m, cond (m)); // encapsulated 
blocking point 

// executable C code 
STHREAD_END_BLOCK; 
free (msg); 

} 



7 

8 
9 
J£ 

si; 

w 

£ J 

m 

W 
19 
20 
21 
22 
23 

24 The program structure above includes, in its definition for the function 

25 "function_sthread" an initial program statement obtaining access to the local variables 

26 for the simulated thread. This is the statement referring to "m -> data". 
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1 

2 The program structure above includes a definition for a start-point for the 

3 simulated thread. This is the statement "STHREAD_START_BLOCK (m)", which 

4 makes use of a macro defined for the name "STHREAD_START_BLOCK". 

5 

6 . The program structure above includes a definition for a restart-point for the 

? simulated thread. This is the statement "STHREAD_RESTART_POINT (m)" 5 which 

8 makes use of a macro defined for the name "STH READ_RESTART_POI NT". 

9 

itjj The program structure above includes a definition for a conditional-wait 

M point (a possible suspension of the simulated thread) for the simulated thread. This is the 

=gj statement "STHREAD_COND_WAIT(m, cond(m))", which makes use of a macro de- 

y. 

;i3 fined for the name "STH READ_CON D_WAIT". 

fig The program structure above includes, in its definition for the function 

s ; 

16 "function_sthread", a closing program statement for ending the simulated thread. This 

17 is the statement "STHREAD_END_BLOCK", which makes use of a macro defined for 

18 the name "STHREAD_END_BLOCK". The program structure above also includes a 

19 statement for freeing any data structures used by the simulated thread. This is the state- 

20 ment "free(msg)". 

21 
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1 The macro definitions for "STHREAD_START_BLOCK", 

2 "STHREAD_RESTART_POINT", and "STHREAD_END_BLOCK" collectively form 

3 a C language "case" statement. 

4 

5 o The macro "STH RE AD_START_B LOC K" includes the preamble to the 

6 "case" statement: 



#define STHREAD_START_BLOCK (m) switch (m -> line) { case 0: 



Jjij o The macro "STHREAD_RESTART_POINT" includes an intermediate restart 

m point in the "case" statement: 



#define STHREAD_RESTART_POINT(m) case _LINE_: m -> line 
= LINE 



5 ns The restart point uses the C preprocessor to generate tags that the switch 

17 statement uses as branch points. The C macro LINE_ substitutes the line number of 

18 the file being processed, so a series of restart points generates a series of unique cases 

19 within the switch. Setting m -> line to the case just entered means that if the procedure is 

20 re-entered the switch statement will branch to the restart point and continue. 

21 
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o The macro "STH READ_START_BLOCK" includes the close of the "case" 



#define STHREAD_END_BLOCK } 

Thus, the C preprocessor generates a "case" statement in response to use of 
these macros, which allows the programmer to easily specify each of the proper restart 
points of the routine. 

o A set of program structures for suspending and restarting simulated threads; 



"STHREAD_COND_WAIT" to conditionally either wait for an operation to complete, 
or to suspend and restart the simulated thread while waiting for resources for the opera- 
tion to complete. 

o A set of program structures for initiating simulated threads; 



statement: 



#define STHREAD_COND_WAIT(m, 
STHREAD_RESTART_POINT(m); \ 
{\if(c)\ 

sthread_suspend(); \ 

} 



c) 



\ 



At an individual restart point, the programmer can use the macro 
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o The macro "STHREADJNIT" allocates memory for the simulated thread, sets the 

C preprocessor value LINE to zero, sets the value of "data" to the private 

stack area of the particular simulated thread, and sets a value for "handler" to a 
function passed to the macro as an argument. 



#define STHREAD_INIT(m, msg, handler) \ m = malloc(sizeof(*m)); \ 
msg = zalloc(sizeof(*msg)); \ m -> line = 0; \ m -> data = msg; \ m -> 
handler = handler 



o A set of program structures for actually performing the simulated thread; 



void 

function(int arg) 
{ 

functionjnsg *msg; 
sthread_msg *m; 

STHREAD_INIT(m, msg, function_sthread); 
msg->arg = arg; 

sthread_run(m); 

} 
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1 The program structure above includes, in its definition for the function 

2 "function", program code for creating the data blocks for the simulated thread, and for 

3 placing data in those data blocks. These are the statements "STHREAD_INIT(m, msg, 

4 function_sthread)" and "msg -> arg = arg" which make use of a macro defined for the 

5 name "STHREADJNIT". 

6 

7 o A set of program structures for scheduling performance of simulated threads; 
8 

switch (m->line) { //a field in sthreadjnsg 
case 0: 

// executable C code 
STHREAD_RESTART_POINT(m); 

// executable C code 
STHREAD_RESTART_POINT(m); 
// executable C code 

> 

o 

18 The program structure above includes, in its definition for the function 

19 "function", program code for creating the data blocks for the simulated thread, and for 

20 placing data in those data blocks. These are the statements "STHREAD_INIT(m, msg, 

21 function_sthread)" and "msg -> arg = arg", which make use of a macro defined for the 

22 name "STHREADJNIT". 

23 
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o A set of program structures for suspending and resuming performance of simulated 
threads. 



typedef struct sthread_msg { 
int line; 
void *data; 

void (*handler)(sthread_msg *); 

} 

jmp_buf sthread_env; 
void 

sthread_run(sthread_msg *m) 
{ 

if (!setjmp(sthread_env)) { 
m->handler(m); 
free(m); 

} 

} 

void 

sthread_suspend() 
{ 

longjmp(sthread_env, 0); 

} 

sthread_msg *suspended_sthread; 
int ready; 
int 

cond(sthread_msg *m) 
{ 

if (ready) 
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return 1 ; 
suspended_sthread = m; 
sthread_suspend(); 

} 

int 

set_cond() 
{ 

ready = 1 ; 

if (suspended_sthread) { 

sthread_msg *m = suspended_sthread; 
suspended_sthread = 0; 
sthread_run(m); 

} 

} 

// cond() changed 
sthread_run(suspended_sthread); 



• A set of program structures for performing simulated threads in conjunction with the 
WAFL file system, as shown above. 
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1 Generality of the Invention 

2 

3 The invention has general applicability to various fields of use, not neces- 

4 sarily related to the services described above. For example, these fields of use can in- 

5 elude devices other than file servers. 

6 

7 Other and further applications of the invention in its most general form, will 

8 be clear to those skilled in the art after perusal of this application, and are within the 

9 scope and spirit of the invention. 

m Technical Appendix 

;iy The technical appendix enclosed with this application is hereby incorpo- 

jjij rated by reference as if fully set forth herein, and forms a part of the disclosure of the in- 

IS vention and its preferred embodiments. 

o 

16 

17 Alternative Embodiments 

18 

19 Although preferred embodiments are disclosed herein, many variations are 

20 possible which remain within the concept, scope, and spirit of the invention, and these 

21 variations would become clear to those skilled in the art after perusal of this application. 
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